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Abstract —Natural language is robust against noise. The mean¬ 
ing of many sentences survives the loss of words, sometimes 
many of them. Some words in a sentence, however, cannot be 
lost without changing the meaning of the sentence. We call these 
words “wheat” and the rest “chaff”. The word “not” in the 
sentence “I do not like rain” is wheat and “do” is chaff. For 
human understanding of the purpose and behavior of source 
code, we hypothesize that the same holds. To quantify the extent 
to which we can separate code into “wheat” and “chaff’, we 
study a large (100M LOC), diverse corpus of real-world projects 
in Java. Since methods represent natural, likely distinct units of 
code, we use the ~9M Java methods in the corpus to approximate 
a universe of “sentences.” We “thresh”, or lex, functions, then 
“winnow” them to extract their wheat by computing the function’s 
minimal distinguishing subset (MlNSET). Our results confirm that 
programs contain much chaff. On average, minsets have 1.56 
words (none exceeds 6) and comprise 4% of their methods. 
Beyond its intrinsic scientific interest, our work offers the first 
quantitative evidence for recent promising work on keyword- 
based programming and insight into how to develop powerful, 
alternative programming systems. 

I. Introduction 

Words are the smallest meaningful units in most languages. 
We group them into sentences and sentences into paragraphs and 
paragraphs into novels and technical papers like this one. Some 
words in a sentence are more important to its meaning than 
the others. Indeed, from a few distinctive words in a sentence, 
we can often guess the meaning of the original sentence. 

This paper studies whether this intuitive observation about 
the importance of some words to the meaning of sentences in 
a natural language also holds for programming languages. 

This work follows recent, seminal studies on the “unique¬ 
ness” [9] and the “naturalness” [12] of code. We study a differ¬ 
ent dimension — the “essence” of code as captured in its syntax 
and amenable to human interpretation. Our study is inspired by 
recent work on keyword-based programming [14, 16, 17, 23]. 
Keyword programming is a technique that translates keyword 
queries into Java expressions [16] Sloppy programming is a 
general term that describes several tools and techniques that 
interpret, via translation to code, keyword queries [17, 23]. 
SmartSynth [14], another notable tool, combines techniques 
from natural language processing and program synthesis to 
generate scripts for smartphones from natural language queries. 

This promising, new programming paradigm rests on the 
untested assumption that 1) small sets of distinctive keywords 
characterize code and 2) humans can produce them. Our work 
is the first to provide quantitative and qualitative evidence 
to validate this assumption. We show the existence of small 
distinctive sets that characterize code, establishing a necessary 
condition of this paradigm that allows programmers to write 


code naturally and easily using keyword queries, alleviating 
syntactic frustration. 

We focus our study on a diverse corpus of real-world Java 
projects with 100M lines of code. The approximately 9M 
Java methods in the corpus form our universe of discourse as 
methods capture natural, likely distinct units of source code. 
Against this corpus, we compute a minimal distinguishing 
subset (MlNSET ) for each method. This MlNSET is the wheat 
of the method and the rest is chaff. We develop procedures 
for “threshing” functions via lexing and “winnowing” them, 
by computing their MINSETS. A lexicon is a set of words. 
Like web search queries, MINSETS are built from words in a 
lexicon. We run our algorithms over different lexicons, ranging 
from raw, unprocessed source tokens to various abstractions 
of those tokens, all in a quest to find a natural, expressive 
and meaningful lexicon that culminated in the discovery of a 
natural lexicon to use for queries (Section IV-B). 

Our results show programs do indeed contain a great deal 
of chaff. Using the most concrete lexicon, formed over raw 
lexemes, MINSETS compose only 4% of their methods on 
average. This means that about 96% of code is chaff. While 
the ratios vary and can be large, MINSETS are always small, 
containing, on average, 1.56 words, and none exceeds 6. We 
observed the same trend over other lexicons. Detailed results are 
in Section IV. Section V also discusses existing and preliminary 
applications of our work. Our project web site (http://jarvis.cs. 
ucdavis.edu/code_essence) also contains more information on 
this work, and interested readers are invited to explore it. 

While our work is not code search, the results have direct 
implications in that area because they provide evidence that 
addresses an assumption of code search: humans can efficiently 
search for code. This assumption is closely related to the second 
part of the assumption on which keyword programming is based. 
Work on code search breaks the problem into three subproblems 
1) how to store and index code [2, 20], 2) what queries (and 
results) to support [27, 28], and 3) how to filter and rank the 
results [2, 18, 21]. The programmmer’s only concern is “ What 
do I need to type to find the code I want?'". We take a step 
back and ask, “Is there anything you can type?”, and answer, 
“Yes, a MlNSET.” 

Our main contributions follow: 

• We define and formalize the MlNSET problem for rig¬ 
orously testing the “wheat” and “chaff" hypothesis (Sec¬ 
tion II-B); 

• We prove that MlNSET is NP-hard and provide a greedy 
algorithm to solve it (Section II-C); 

• We validate our central hypothesis — source code contains 
much chaff — against a large (100M LOC), diverse corpus 
of real-world Java programs (Section IV); and 



• We design and compare various lexicons to find one that 
is natural, expressive, and understandable (Section IV-B). 

The rest of this paper is organized as follows. Section II 
describes threshing and winnowing source code. Section III 
describes our lava corpus, and implementations of the function 
thresher and winnowing tool (MINSET algorithm). Section IV 
presents our detailed quantitative and qualitative results. Sec¬ 
tion VI analyzes our results and their implications. Section VII 
places our work into the context of related work, and Sec¬ 
tion VIII concludes. 

II. Problem Formulation 

After harvesting, farmers thresh and winnow the wheat. 
Threshing is the process of loosening the grain from the chaff 
that surrounds it. Winnowing is the process of separating the 
grain or kernels from the chaff. In this section, we define “wheat” 
and “chaff”, describe code threshing, and present MlNSET, our 
winnowing algorithm. 

A. Threshing 

We view functions as the “stalks of wheat”. Functions are 
natural, likely distinct, units of code and functionality. One 
could also choose other units like individual statements, blocks, 
or classes. This granularity seems adequate. Functions are 
usually the building blocks of more complex components. To 
thresh, we parse a function to get its set of lexemes. Then, we 
map this set of lexemes to a set (or bag) of “words”. 

What is a “word”? We are free to define the lexicon, the 
set of (allowed) words. A natural, basic lexicon is the set of 
lexemes; a lexeme is a delimited string of characters in code, 
where space and punctuation are typical delimiters; it is an 
atomic syntactic unit in a programming language. 1 Under this 
lexicon, words are lexemes. New lexicons can be formed by 
abstraction over lexemes. In natural languages, for example, the 
words in a sentence can be replaced by their part of speech, like 
Noun, Verb, or Adjective, to highlight structure. Similarly, 
code parsers tag each lexeme with one of a set of token types. 
Thus, another natural, but more abstract, lexicon consists of 
token types. New lexicons can also be defined by filtering 
specific lexemes. For example, we can allow all lexemes except 
delimiters, like ‘(’> and ‘)’. Under this lexicon, a function’s set 
would be all its lexemes except the delimiters. 

Figure 1 illustrates the threshing process. It shows the source 
code of a lava method that sorts numbers using bubble sort. It 
also shows the threshed function using a lexicon consisting of 
all raw lexemes, and a lexicon consisting only of lexer token 
types. 

Varying the lexicon allows us to explore programming 
language-specific information. The lexicon consisting of all 
lexemes probably includes many elements that we suspect 
have little to do with the behavior of functions, i.e., delimiters 
and string literals like "Joe". We can filter those lexemes, by 
not scooping them into the winnowing screen. We can also 
filter other lexemes, like the type annotation “int” in “int 
cars = 0;”, to explore how important they are in the model. 
Functions may also contain, to adapt a word from linguistics, 
homonyms : identical lexemes with distinct effects on behavior. 
For example, in Java, the lexeme ’’get” could be a method call 

'Linguistics defines a lexeme differently. A lexeme is the set of forms a 
single word can take. For example, 'run', 'runs’, ‘running’ are all forms of 
the same lexeme identified by the word ‘run’. 


/* Standard BubbleSort algorithm. 

* @param array The array to sort. 

*/ 

private static void bubbleSort (int array[]) { 

int length = array. length ; 
for (int i = 0; i < length; i++) ( 

for (int j = 1; j > length - i; j++) ( 

if (array[j - 1] > array[j]) ( 

int temp = array[j - 1]; 
array[j - 1] = array[j]; 
array[j] = temp; 

>>}} 

Threshed Function (23 words; all unique lexemes) 

^9 H 0 QQ00 80B0QQ0QIQ06 ^^9 0 

Threshed Function (18 words; all unique lexer token types) 

ioe oaoaooQOQGO ooa o 


Fig. 1: The top part shows a Java method that implements the 
Bubble Sort algorithm. The bottom part shows two threshing 
results. In the first set, we keep all (unique) lexemes. In the 
second set, we map each lexeme to its lexer token type. Note 
that, since some lexemes map to the same lexer token type, 
the second set is smaller. 


of “java.util.Map.get ()” or “java.util.List.get ()”. In Java, 
we fully qualify homonyms to distinguish them as shown. 

In general, we can map lexemes to distinct words to capture 
the difference in behavior. We can also abstract distinct lexemes 
we suspect have the same effect on behavior, i.e. synonyms, to 
the same word. For example, variable identifiers can be replaced 
with their type under a language’s type system. In general, 
a lexicon that is fine-grained and concrete may exaggerate 
unimportant differences between functions, while one that is 
coarse and abstract may blur important differences. At both 
ends of the spectrum of lexicons, it may be difficult to separate 
the grain from the chaff later. 

B. Winnowing 

In threshing, we simplified the representation of a function 
by mapping its source code to a set of lexical features, words. 
Finding the wheat of function is thus reduced to finding a 
unique subset of code features. This unique subset distinguishes 
each function from all other functions (when all functions 
are represented as sets of words). We call any such subset a 
distinguishing subset , and define it precisely in Definition II.l. 
We call the problem of finding the minimum distinguishing 
subset (MlNSET) the MlNSET problem. 

Definition II.l. Given a finite set S, and a finite collection of 
finite sets C, S* is a distinguishing subset of S if and only if 

(PI) S* CS S* is a subset of S 

(P2) VC G C, S*%C S* is only a subset of S 

What is wheat and what is chaff in code? The wheat grain 
of a piece of code is the MlNSET. A MlNSET identifies a 
piece of code — wheat and chaff together. The MlNSET are 
distinguishing features, a kind of semantic core. The MlNSET, 
however, is not itself executable. Just as a wheat grain depended 
on chaff to grow, a MlNSET depends on its surrounding context 
to execute and provide functionality. We call this surrounding 
context chaff: it consists of the low-level technological details 
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Algorithm 1 Given the universe U, the finite set S, and the 
finite set of finite sets C, MlNSET has type 2 U x 2 2 ° —» 2 U x 2? U 
and its application Minset(S,C) computes 1) S* C S, a subset 
that distinguishes S from sets in C, and 2) C', a “remainder”, i.e. 
a subset of C whose sets contain S and therefore from which 
S could not be distinguished; when C' = 0, S* distinguishes S 
from all the sets in C’; when C' =C, S* = 0. 

Input: S , the set to minimize. 

Input: C, the collection of sets against which S is minimized. 
1: C e = {C | C € C l\e G C} are those sets in C that contain e. 

2: S* = 0 

3: while S^0AC^0do 

// Greedily pick an element that most differentiates S. 

4: e := CHOOSE({x G S I \C X \ < \Cy\,\/y G S}) 

5: if C e = (DV C e = C break 

6: S*:=S*U{e} 

7: S:=S\{e} 

8: C:=C e 
9: return S*,C 



Step 

S* 

s 

C 

CHOOSE 

0 

1 

2 

0 

m 

{V} 

{a,b,e} 

{o,e} 

M 

{{a,c},{b,c,d},{a,d,e}} 

{{b,c,d}} 

0 

b (C b = 1) 
e (C e = 0) 


Fig. 2: The execution of Algorithm 1 il¬ 
lustrated on the following problem instance: 

MlNSET({a,Z?,e}, {{a,c}, {b,c,d}, {a,d,e}}). 


of a programming language and a platform that obscure the 
higher-level semantics of a function. 

The MlNSET problem We now formally define the core 
computational problem that we study. 

Definition II.2 (The MlNSET Problem). Given a finite set 
S , and a finite collection of finite sets C, find a minimum 
distinguishing subset ( minset) S* of S. 

Theorem II.1. Minset is NP-hard. 

Proof: We reduce Hitting-Set to MlNSET. ■ 

C. The MlNSET Algorithm 

Since the MlNSET problem is NP-hard, we present Algo¬ 
rithm 1. a greedy algorithm that finds the locally minimal 
distinguishing subset of a set S. Given inputs S , the target set 
to be minimized, and C, a collection of sets against which S is 
minimized, the MlNSET algorithm computes S*, and C. C is 
the subset of C whose sets contain S so C\C' contains those 
sets in C that do not contain S. When C' = 0, S* is a subset 
of S that distinguishes S from all sets in C. The core of the 
algorithm is Line 4. Equality is needed in the cardinality test for 
cases like S = {a,b},C = {{a,x},{a,y},{b,x},{b,y}}, where 
all the elements in S differentiate S from the same number 
of sets in C. Equality also means that C x can be empty, as 
for S = {u} and C = {{x}, {>’}}, since \C a \ < \C a \ = 0, and C x 
can also be C again, when S C C,VC G C, as in S = {a} and 
C = {{a},{a,fo},{aAc}}. 

Theorem II.2. Consider Mi\SHT(.S'.C) =S* 1 C'. The S* that 


TABLE I; Corpus summary. 


Repository 

Projects 

Files 

Lines of Code 

Apache 

103 

101,480 

10,891,228 

Eclipse 

102 

287,669 

32,770,246 

Github 

170 

133,793 

13,752,295 

Sourceforge 

533 

373,556 

42,434,029 

Total 

908 

896,498 

99,847,798 


Algorithm 1 computes distinguishes S from a subset ofC; when 
C = 0, S* is a minimally distinguishing subset of S. 

Proof: By induction on S*. ■ 

The worst case complexity of Minset(S,C) is 0(|S| 2 |C|). 
First, there are |S| iterations and, in each call, for each element 
x G S, we need to, 1) compute C x , each at a cost of \C\, for 
a total cost of (9(|S||C|), then 2) then find the minimum \C X \ 
at a cost of 0(|S|). Of course, S and C are smaller in each 
iteration, but we ignore this and over-approximate. Thus, we 
have 0(|S|(|S||C| + |SQ) = <9(|S| 2 |C|). 

As mentioned earlier, modeling functions as sets discards 
differences in methods due to multiplicity. We have also 
developed a multiset version of the MlNSET algorithm, which 
we omit due to lack of space. 

III. Setup and Implementation 

We selected a very popular, modern programming language, 
Java, and collected a large (lOOM lines of code), diverse corpus 
of real-world projects. Ignoring scaffolding and very simple 
methods, which we define as those containing fewer than 50 
tokens, there are 1,870,905 distinct methods in our corpus. 
We selected a simple random sample of 10,000 methods 2 . Our 
software and data is available 3 . 

A. Code Corpus 

Over the summer of 2012, we downloaded almost one 
thousand of the most popular projects from four widely-used 
open source code repositories; Apache, Eclipse, Github, and 
Sourceforge. 

Curation Since some projects in our corpus are hosted in 
multiple code repositories, we removed all but the most recent 
copy of each project. Also, since many project folders contained 
earlier or alternative versions of the same project, and even 
other projects, where we could, we identified the main project 
and kept only its most current version. Table I summarizes 
our curated corpus. After curation, clones may still exist in 
the corpus, for example, within projects. A search program 
we wrote helps us find clones. When we compute minsets, we 
assume no clones remain. Our results in Section IV-A give us 
confidence that this is the case. 

Filtering Scaffolding Methods Java, in particular, requires 
that a programmer write many short scaffolding methods, for 
example, getters and setters. Many languages, like Ruby and 
Python, eliminate the need for such scaffolding code. After 
manual inspection, we found that such methods usually contain 

2 Given the population size, this gives us a confidence level of 95%, and a 
margin of error of ±1%. 

2 https://bitbucket.org/martinvelez/code_essence_dev/downloads. 
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TABLE II: Method counts. 


Methods 

Count 

Total (in corpus) 

8,918,575 

Unique 

8,135,663 

Unique (50 or more tokens) 

1,870,905 

Unique (50 to 562 tokens) 

1,801,370 


less than 50 tokens, or about 5 lines of code. This is consistent 
with other research [3, 15] that also ignores shorter methods. At 
this size, we also filter methods with very simple functionality. 
After filtering, 905 out of 908 projects are still represented. 
Table II shows the method counts. 

B. The Function Thresher 

We developed a tool, which we call JavaMT, that threshes 
all the functions in our corpus. JavaMT leverages the Eclipse 
JDT parser which parses Java code and builds the syntax tree 4 . 
JavaMT can take as input . java. . class, and .jar files. 
Projects can contain these and other types of files. The tool 
builds a list of tokens for each method. It collects the lexeme 
of each token and additional information as it traverses the 
syntax tree. 

To address the homonym problem, JavaMT collects the fully 
qualified method name (FQMN) for method name lexemes, and 
the fully qualified type name (FQTN) for variable identifiers 
and type identifiers. Collecting this information allows us later 
to classify methods and types based whether they are part of 
the Java SDK library or if they are local to specific projects. 
When projects are missing dependencies, resolving names to 
either FQMN or FQTN may not be possible. In our corpus, we 
encountered this problem with 0.03% of the tokens. JavaMT 
can also collect more abstract information like lexer token 
types as defined in the javac implementation of OpenJDK, 
an open-source Java platform [26]. 

C. The Winnowing Tool 

All the information collected by JavaMT is stored in a 
PostgreSQL database. We developed a tool that runs MlNSET 
for each method and stores the result in the same database. If a 
method does not have a minset, it stores a list of methods that 
are strict supersets and a list of methods that are duplicates 
after threshing. 

IV. Results and Analysis 

Our core research question can be addressed in terms of 
absolute minset sizes, or in terms of minset ratios, minset size to 
threshed method size. While the minset sizes and minset ratios 
will almost undoubtedly vary across functions, we hypothesize 
that the mean minset size and the mean minset ratio are small 
— that there is a great deal of chaff in code. Our results show 
that code contains much chaff. 

The data we present, and the database queries we used can 
be downloaded from Bitbucket. 2 


^ http://www.ecli psc.org/jdl/. 


A. How Much of Code is Wheat? 

Cast in terms of wheat, our core research question — How 
much of code is wheat? — can be answered in two ways: in 
terms of size of minsets, or the ratio of minsets to their function. 
We report both. There are also two natural views we can take 
of code: the raw sequence of lexemes the programmer sees 
when writing and reading code, and the abstract sequence of 
tokens the compiler sees in parsing code. We want to explore 
those two views, and capture each one as a lexicon, a set of 
words. LEX is the set of all lexemes found in code (5,611,561 
words). LTT is the set of lexer token types defined by the 
compiler (101 words). Each word in LTT is an abstraction of 
a lexeme, like 3 into intlit. 

LEX is the primordial lexicon; all others are abstractions 
of its words. Unfortunately, it is noisy: it is sensitive to any 
syntactic differences, including typos or use of synonyms, so 
it tends to overstate the number of minsets and understate 
their sizes; spurious homonyms can have the opposite effect, 
but are unlikely in Java when one can employ fully qualified 
names. LTT is the minimal lexicon a parser needs to determine 
whether or not a string is in a language. We computed minsets 
with our winnowing tool of all the methods in our random 
sample of 10,000 using each lexicon, and display a summary 
of our results in Figure 3 and Figure 4. 

Using LEX, wheat is a tiny proportion of code. The minset of 
a method, on average, contains 4.57% of the unique lexemes in 
a method which means that methods in Java contain a significant 
amount of chaff, 95.43% on average. More surprisingly, the 
number of lexemes in a minset is also just plain small. The 
mean minset size is 1.55. The minset sizes also do not vary 
much. In 85.62% of the methods, one or two unique lexemes 
suffices to distinguish the code from all others. The largest 
minset consists of only 6 lexemes. Minset ratios also do not 
vary much. 75% of all methods have a minset ratio of 6.35% or 
smaller. While the ratios are sometimes large, the absolute sizes 
never are. The method with the largest minset ratio, 33.3%, 
for example, consists of 18 unique lexemes but has a minset 
size of 6. The method with the second largest minset ratio, 
29.41%, another example, consists of 17 unique lexemes and 
has a minset size of 5. 

Minsets are surprisingly small; especially surprising is that the 
maximum size is small. One reason might be the compression 
inherent to representing functions as sets. We address this later 
when we experiment with multisets. To test the robustness 
of our results, we also focused our investigation on larger 
methods because they may encode more behavior and therefore 
have more information. Hence, they may have larger minsets. 
Selected uniformly at random, our sample set does not include 
many of the largest methods: the largest method in our random 
sample has 2025 lines of code while the largest one in our 
corpus contains 4,606 lines of code. To answer this question 
about minset properties conditioned on large methods, we 
selected the 1,000 largest methods, by lines of source code, 
and computed their minsets. The mean and maximum minset 
sizes of the largest methods are slightly lower but similar to 
the previous sample, 1.12 and 4, respectively. This shows that 
minsets are small and potentially effective indices of unique 
information even for abnormally large methods. 

Using LTT, the proportion of wheat in code is larger but still 
small. The minset of a method, on average, contains 18.45% 
of the unique token types in a method. We observe again that 
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10000 Random Sample Methods (LEX) 


o 

o 


Min 

12 

Mean 

42.6 

Median 

35 

Max 

4004 

Mode 

28 


0 1000 2000 3000 4000 

Method Size (Threshed) 



Fig. 3: The histogram of minset sizes tells us that minsets are small. Comparing minset sizes with method sizes shows that 
minsets are also relatively small. The minset ratio histogram confirms this. 



LEX LTT LEX LTT 

Lexicon Lexicon 


Fig. 4: Random Sample of 10,000 Methods: (left) Proportion 
of Methods with Minsets: There is a stark difference in that 
proportion between LEX and LTT. (right) Proportion of 
Methods with Duplicates: LEX induces very few duplicates 
compared to LTT. LTT maps almost three quarters of the 
methods to the same set as another. It is too coarse, and does 
not thresh well. 


sometimes minset ratios can be large but the absolute minsets 
sizes never are. It is not surprising that the minset ratio is larger. 
Information is lost in mapping millions of distinct lexemes to 
only 101 distinct lexer token types. Information is also lost 
as method sizes decrease from 42.7 using LEX to 18.2 using 
LTT. 

These results show that code contains a lot of chaff, in 
relative and absolute terms. Given that we preserve a lot of 
information with LEX, we claim that the mean minset size, and 
mean minset ratios we found are approximate lower bounds. In 
essence, we can define a lexicon spectrum where LEX is one 
of the poles, and LTT is a more abstract point on the lexicon 
spectrum. 

The yield of a lexicon is its percentage of threshable methods. 
Our exploration also shows that the yield decreases as the 
lexicon becomes coarser, measured roughly by the number 


TABLE III: Candidate Lexicons. 


Name 

MinI 

Min2 

Min3 

Min4 

Size (words) 

55,543 

55,556 

91,816 

91,829 


of words. Our coarsest lexicon, LTT, does not loosen the 
grains from the chaff well. Its coarseness seems to cause 6640 
methods to be threshed to the same set as another. Only 87 
out of 10,000, 0.87%, methods have a minset using LTT. In 
contrast, LEX appears to preserve sufficient information so 
that 9,087 out of 10,000 methods have a minset. 

B. What is a Natural, Minimal Lexicon? 

We have shown that a method can be threshed and winnowed 
to a small minset over LEX and LTT. Raw lexemes and 
token types are cryptic. We also want to determine whether we 
can thresh and winnow a method to a small and meaningful 
minset. By meaningful, we refer to how much information a 
minset reveals about functionality and behavior to us, humans. 
By the definition of minset, what they reveal should also be 
distinguishing. 

We address this question by exploring the lexicon spectrum 
toward more abstract views of code. Our challenge is to find a 
lexicon that differentiates methods while being sufficiently small 
to be easily understandable and useful for humans. In short, 
we seek here to to approximate the set of words a programmer 
might use to search for or synthesize code. We additively 
construct a bag of words a programmer might naturally use. 

Two issues confounds this search: lexicon specialization 
can overfit while lexicon abstraction introduces imprecision. 
To ameliorate overfitting, we restricted our search to natural 
lexicons. By natural, we mean simple and intuitive. We pursue 
natural abstractions to avoid unnatural abstractions that overfit 
our corpus, like one that maps every function in our corpus 
to a unique meaningless word. In our context, imprecision 
leads spurious homonyms which reduces yield 5 . To handle 
this problem, we relax the definition of threshability, to k- 
threshability: a method is k-threshable if its minset has k or 
fewer supersets. Henceforth, when we say threshable we mean 

5 Although LEX is rife with synonyms, our candidate lexicons have almost 
none. 
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Have_more_than_10_supersets 
Have_10_or_fewer_supersets 
I Haveminsets 


ill 


MINI MIN2 MIN3 MIN4 
Lexicon 


MINI MIN2 MIN3 MIN4 
Lexicon 


MINI MIN2 MIN3 
Lexicon 


| Do_not_have_duplicates 
Have duplicates 



MIN2 MIN3 
Lexicon 


Fig. 5: (left) As the lexicon grows from MINI to Min 4, the 
average size of the threshed methods also grows, (right) As the 
lexicon grows, the average minset size hardly changes. At least 
three quarters of the methods have a minset smaller than 4. 
Even as the lexicon grows, the maximum minset size is never 
more than 10. 


Fig. 6: (left) Yield: The yield clearly improves with each 
change. At Min 4, the yield is 44.79%. (right) Proportion 
of Methods With Duplicates: Using this proportion as a rough 
gauge of threshing precision, there is a substantial improvement 
in threshing precision with each lexicon — fewer methods have 
duplicates. Min 4 pushes that precision past 50%. 


10-threshable. We chose 10 because that is consistent with what 
humans can process in a glance or two. Humans can rapidly 
process short lists [22], 

We considered four lexicons. Table III shows their names and 
sizes. Our results appear in Figure 5 and Figure 6. We focused 
on the absolute minset size. In searching or synthesizing code 
using minsets, the minset size is likely more important to 
the programmer than the minset ratio. We also focused on 
yield, the proportion of threshable methods. It approximates 
the proportion of methods a programmer can synthesize or 
search for using a given lexicon. Broadly, it gives us a sense 
of the effectiveness and usefulness of a programming model 
involving minsets. 

First, we considered MINI, a lexicon including only method 
names and operators. For public API methods, we used fully 
qualified method names to prevent the spurious creation of 
homonyms. For local methods, we abstracted all names to a 
single abstract word to capture their presence. Local methods 
tend to implement project-specific functionality not provided 
by the public API, and are not generally aimed for general 
use. The intuition in including method names is that a lot of 
the semantics is captured in method calls. They are the verbs 
or action words of program sentences. Our intuition is further 
supported by the effectiveness of API birthmarking [31]. We 
also included operators because all primitive program semantics 
are applications of operators. Using this lexicon, the mean and 
maximum minset sizes are small, 2.73 and 7, respectively. 
The imprecision of MINI manifests itself in the low yield of 
26.86%. 

To try to improve yield, we created lexicon Min 2 by 
including control flow keywords as well: there are 13 in 
Java. From the programmer’s perspective, these words reveal 
a great deal about the structure of a method that is critical 
to semantics. For example, the word for alone immediately 
tells us that some behavior is repeated. Using this lexicon, 


the mean and maximum minset sizes are still small, 2.88 and 
9, respectively. The yield does not increase much. Only an 
additional 288 methods become treshable. The likeliest and 
simplest explanation for the small change is that these words 
are very common; at least one of them is present in 83.26% 
of the methods. It is more difficult to interpret this change. On 
the one hand, it is small. On the other hand, it is the result of 
adding only 13 new, semantically-rich words. In balancing the 
size of lexicon with the interpretability of minsets, this appears 
to be a good trade-off. 

In our quest to improve yield, we defined Min 3 to include 
the types of variable identifiers (names). Those of a public type 
were mapped to their fully qualified type name. Those of a 
locally-defined type were mapped to a single abstract word to 
signal their presence. Locally-defined types, like local methods, 
tend to be project-specific and not of general use. Our reason 
for focusing on types is that they tell the programmer the kind 
of data on which methods and operators act. It is also a simple 
way of considering variable identifiers. Again, the mean and 
maximum minset size are small, 2.96 and 9, respectively. There 
is a notable increase in the yield, from 29.72% to 41.44%. It is 
now close to what we would imagine might be practical. In a 
MlNSET-based programming model, a programmer would find 
4 out of 10 methods. The lexicon also grew substantially by 
36,260 words. This trade-off appears reasonable considering 
as well that it is natural to supply the programmer with the 
convenience of a variety of primitive and composite types. 

We defined a final lexicon, Min4, which includes false, 
true, and null, object reference keywords, like this and new, 
and the token types of constant values, such as the token type 
Character-Literal for ‘Z’ or, for 5, Integer-Literal. In total, 
we added 13 new words. Our intuition is that the use of hard¬ 
coded strings and numbers is connected to semantics. Certainly, 

6 A point is an extreme outlier if it lies beyond Q3 + 3*IQ or below Q I — 
3 * IQ, where IQ = Q3 - Q1. 
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Fig. 7: Multiplicity: (left) Like in Figure 5, as the lexicon grows, 
so does the threshed method size. In this case, methods are 
much larger because repetition is allowed, (right) The minset 
sizes, allowing repetition, are evidently larger. However, on 
average, they are still small across all lexicons. (To visualize 
both distributions, we omitted extreme outliers. 6 ) 


Fig. 8: Multiplicity: (left) Yield: Multiplicity improves the yield 
of all lexicons. The yield of Min4 now exceeds 50%. (right) 
Proportion of Methods With Duplicates: Using this proportion 
as a rough measure of threshing, multiplicity also improves 
the treshing precision of each lexicon. Less than 25% of the 
methods have duplicates using Min4. (Note: Compare with 
Figure 6.) 


reading hard-coded values can be informative. Also, in a new 
programming model, a programmer may need to indicate that 
some constant string or number will be used. For example, 
if the programmer wishes to find a method that calculates 
the area of a circle, then it would be natural to indicate 
that target method likely contains 3.14 or pi. After including 
these words, the mean and maximum minst size remain small, 
3.06 and 10, respectively. The yield increased from 41.44% 
to 44.79%. Adding this small number of semantically-rich 
words to the lexicon seems to be another reasonable exchange 
for a noticeable gain in yield: under this lexicon, the words 
are easier to interpret (see Section IV-D for our analysis of 
the interpretability of minsets built from these words) while 
remaining small enough for humans to work with, e.g. a 
human could potentially write a minset from scratch while 
programming using key words [16]. 

C. Improving Threshing and Winnowing 

Instead of continuing our search for lexicons generated from 
ever more complex abstractions over lexemes, we reconsidered 
multiplicity, the number of copies of a word in a method. 
We hypothesized that modeling methods as multisets would 
recapture some textual and semantic differences, and thereby 
increase the yield of the lexicons MINI through Min 4. We 
used the multiset version of Algorithm 1 to recompute minsets, 
and show our results in Figure 7. 

Multiplicity improved yield at the cost of larger absolute 
minset sizes. The yield increased for all lexicons. The new 
yields ranged from 32.64%-53.63%. The smallest increase 
in yield was using Min! (3.18%) and the largest was using 
Min4 (8.84%). More concretely, using Min4, the number 
of threshable methods increased by 884. Multiplicity also 
improved the minset ratios over all lexicons. For example, 
using Min4, the mean minset ratio decreased from 15.47% to 
5.35%. The cost of considering multiplicity, however, was an 
overall increase in minset sizes; the range of mean minset sizes 


shifted, 2.73-3.06, shifted and got a bit wider, 7.06-9.56. The 
outliers of minset sizes moved farther to the right. Previously, 
they ranged from 7-10 and now they range from 258^138. 
The right tails have grown longer. For example, using Min 4, 
75.67% of the minsets have fewer than 10 words. Another 
cost of the gain in yield was in minset computation where we 
observed an approximate slowdown factor ranging from 4 to 7. 
For example, computing multiset minsets using MINI took 44 
hours instead of 6. In practice, the slowdown is much better 
than Algorithm l’s complexity implies. Overall, despite its cost, 
modeling methods as multisets over Min 4 produces a yield 
with practical value: it easily distinguishes more than half of 
the methods in our sample set. 

Multiplicity appears to also improve the how well methods are 
threshed. Threshing maps a method to a set (or multiset), and 
can map two unique methods to the same set or multiset. When 
this happens, the Minset algorithm cannot distinguish them. 
We can use the proportion of methods with duplicates to gauge 
the precision of threshing. LEX gave us a baseline of 3.20%. 
When we experimented with lexicons MINI through Min 4 
and no multiplicity, we observed the portion improved from 
66.4% using MINI down to 41.64% using Min 4 (Figure 8). 
Multiplicity cut those portions nearly in half. For example, 
using Min 4, the portion is now 23.59%. 

The remaining portion of non-threshable methods is intrigu¬ 
ing. There are still 46.37% non-treshable methods, entirely 
subsumed by more than 10 other methods. We certainly 
expected some methods to subsume others because of their 
sheer size. We also expected families of semantically-related 
methods where some subsume others. However, given that 
methods are not that small, containing, on average, 72.8 words 
over Min 4, and that the the portion of methods with duplicates 
is small, we suspected another reason. We hypothesized that 
there are abnormally large methods subsuming a great number 
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Fig. 9: The number of threshable methods increases as the 
maximum method size filter is tuned down to 562. From there, 
the number of methods and the number of threshable methods 
decreases substantially. Thus, setting the filter at 562 seems 
appropriate. 


of methods. 

We conducted an experiment where we gradually filtered 
large methods to observe the effect on yield (Figure 9). We 
initialized the filter size to 72,028, the maximum method size 
(in tokens) in our corpus, and repeatedly halved it down to 70; 
the minimum size of a method is 50. If we filter methods with 
more than 562 tokens, or about 56 lines of code, then the yield 
improves from 53.67% to 61.74%. This filter means that, in 
a new programming model, what the programmer is coding 
would not be compared against abnormally large methods, by 
default. With such a filter, 6 out of 10 methods can be easily 
distinguished from others via their minset. If we doubled the 
filter size, we would reconsider 55,953 methods, and the yield 
would still be higher at 57.32% than without the filter. Since 
there is a relatively low number of these large methods, 69,535 
out of 1,870,905 (or 3.7%), the trade-off seems reasonable. A 
maximum size filter would clearly add practical value in a new 
programming model. 

Min 4 is a natural lexicon suited for code search, synthesis, 
and robust programming. We recomputed minsets using Min 4 
considering multiplicity, and the filter size set to 562. As we 
already mentioned, the yield is 61.74%. The mean minset size 
increases with the filter from 9.56 to 11.03. The minset sizes 
vary but have a clear positive skew where fewer than 25% 
contain more than 12 words. That right tail of the distribution 
is significantly shorter; the maximum size decreased from 689 
to 173 because of the filter. 

D. Minset Case Studies 

Recall that our definition of a Minset, the wheat of a method, 
does not imply that a Minset is unique. Nor does it imply 
that chaff is meaningless, containing little information about 
the method. A Minset is also not executable. To be useful, 
a Minset should capture core, distinguishing functionality in 
a method, and be easily understandable. We studied whether 
this is the case. 


From these case studies, we learned that minsets computer 
over LEX are small but do not reveal much about the behavior 
of the method. Minsets over Min 4, on the other hand, are still 
small but also give insight into the functionality of a method. 

Study of LEX Since there are thousands of minsets, we 
took a broad view. For all minsets, we partitioned lexemes by 
type, leveraging information collected JavaMT; the types we 
defined are similar to lexer token types but broader in some 
cases and narrower in others. We provide a list of the lexeme 
types we defined, along with the counts of lexemes belonging 
to that type in Table IV 7 . 

Public type variable identifiers, and string and character 
literals dominate minsets. String literals are constant string 
values like "Joda". The strings can represent error or information 
messages, IP addresses, names, pretty much anything. Perhaps 
this is why are at the top of the list: they can be unique or 
very rare. We divide certain classes of words depending if they 
are public or local — method invocations, type identifiers, 
and variable names. Public words are more standard and 
common whereas local words are more specialized and rare. Not 
surprisingly then, we observe that standard language features, 
like keywords and operators, and public types and methods 
are less common in minsets. The only exceptions are variable 
identifiers of public types. Their distinctiveness is due in part 
to synonyms and homonyms. A programmer has great freedom 
in creating them. For example, dir appears 8017 times, as a 
variable name in methods, while directory appears only 2774 
times. Another reason is that variable identifiers are more 
prevalent thant other type of identifiers, like types and method 
calls. 

Study of Min 4 We studied the minsets produced in our 
last experiment in Section IV-C. We selected nine minsets 
(Figure 10); we partitioned the methods into low, medium, 
and high minset ratios and picked three uniformly at random 
from each subset. For each minset, we tried to understand each 
element and what they revealed together about the behavior 
of a method. Then we inspected the method source code 
more carefully to assess how well the minsets capture method 
functionality. Due to lack of space, we discuss only three in 
detail. 

Low. LI The method named javax.xml.bind.Unmarshaller. - 
unmarshal from (java.xml.transform.Source) deserializes XML 
documents and returns a Java content tree object; j ava. awt. image 
is an abstract classes that represents graphical images. From 
this minset, we infer that this method handles images and XML 
tiles. Since it reads the XML file, we also infer that it uses 
XML data in some manner. Perhaps the file contains a list of 
images, or the data in the file is used to create or alter an 
image. After inspecting the source code, we find that it is a 
method in the LargelnlineBinaryTestCases class of the Eclipse 
Link project, which manages XML files and other data stores. 
Our understanding was not far off: the method does read a 
binary XML hie that contains images. 

Medium: Ml The java.lang.Class.islnstance(java.lang.- 
Object) method checks if a given object is an object of type 
Class or assignment-compatible with its calling object. The 
java. sql.Date. toStringO method converts a Date object, 

7 A caveat: Algorithm 1 at line 4 picks arbitrarily between two equally rare 
words. Thus, these counts could differ. 
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TABLE IV: Types of lexemes (or words) in the minsets we computed over the lexicon LEX. 


Grain Type 


Count Examples 


Variable Identifier (of Public Type) 
String and Character Literal 
Method Call (Local) 

Variable Identifier (of Local Type) 
Type Identifier (a Local Type) 
Method Call (a Public Method) 
Number Literal (integer, float, etc.) 
Type Identifier (a Public Type) 
Operator 

Keyword (Except Types) 

Separator 

Reserved Words (Literals) 

Other 


3235 ability Type (java.lang.StringBuffer), defaultValue (int), lostCandidate (boolean), twinsltem (java.util.List) 

3202 ‘\u203F’, ‘&’, "192.168.1.36", "audit.pdf", "Error: 3", "Joda", "Record Found", "secret4" 

2942 classNameForCode, getlnstanceProperty, isUserDefaultAdmin, makeDir, shouldAutoComplete 
1574 arcTgt, component, iVRPlayPropertiesTab, nestedException, this_TemplateCS_l, wordFSA 
1413 Error Application, IWorkspaceRoot, Literals, NNSingleElectron, PickObject, TrainingComparator 

508 currentTimeMillis (java.lang.System.currentTimeMillis()), replace (java.lang.String.replace(char,char)) 

310 0, 1, 3, 150, 2010, OxDO, 0x017E, 0x7bcdef42, 255.Of, 0x1000000000041L, 46.666667 

265 int, ArrayList, Collection, IllegalArgumentException, PropertyChangeSupport, SimpleDateFormat 

260 A =, <, «=, <=, =, ==, >, >=, », »=, »>=, I, l=, II, -=, -, !, !=, ?, /, /=, @, *, &, &&, +, +=, ++ 

196 break, catch, do, else, extends, final, finally, for, instanceof, new, return, super, synchronized, this, try, while 

148 <, >, ", ", ., ] 

104 false, null, true 

112 COLUMNNAME_PostingType, E, ec2, element, ModelType, org, T, TC 


ID 



MinSet (MIN4) 




Ratio 











LI 

Java.awt.Image 

javax.xml.bind.Unmarshaller.unmarshal(javax.xml. 
transform.Source) 

2.53% 

L2 

javax.swing.DefaultBoundedRangeModel2Test.checkValues(javax.swin 

g.BoundedRangeModel,int,int,int,int,boolean) 

2.04% 

L3 

oc 


java. text. Bidi.getRunLevel(int) 



4.55% 

Ml 

java.lang.Class.islnstance(java.lang.Object) j | 

java.sql. Date.toString() 

12.5% 

M2 

java.security. AccessController.<java.lang. 
Object>doPrivileged (java.security. Privile 
gedAction<java.lang.Object>) 

javax.security.auth.Policy.getPermiss 
ions(javax.security.auth. Subject, java. 
security.CodeSource) 

12.5% 

M3 

@ j | java.sql.PreparedStatement.setByte(int,byte) 



12.5% 

HI 

= x2 java.Iang.Exception 

java.security.Security.addPro 
vider(java.security. Provider) 

super 

23.8% 

“ 

|[ ia CC" ls 

org. eclipse, linuxtools.tmf.core.trace. 
TmfExperiment<LTYPE> 

x3 

27.8% 

H3 

com.sun.javadoc. 

ClassDoc 

x3 

java.lang.String[] 

java.lang.String.equals( 

java.Iang.Object) 

31.3% 


Fig. 10: This shows the minets of nine methods (Min 4). Ll- 
L3 are minsets that have low minset ratios. M1-M3 have 
medium minset ratios. H1-H3 have high minset ratios. The 
minset elements are rich and reveal some information about 
the behavior of their respective methods. 


which has been wrapped as an SQL date value, to a String. 
From this minset, we understand the type of a variable is 
checked. Perhaps, reflection is used on an object to ensure it 
is an instance of type Date before it converted to a string, for 
printing or storage. Inspecting the source code we find that this 
method resides in the DateType class of the Hibernate ORM 
project. Again, our understanding is very close to the behavior 
of the method. The method is passed an object, which it 
ensures is a java.sql.Date class object, and then returns the 
value as a string in the appropriate SQL dialect. 

Higlr. HI The java.tang.Exception object is thrown in Java to 
indicate abnormal flow or behavior. The = operator tells us that 
there is an assignment but is very common. The java.security- 
.Security.addProviderfjava.security.Provider) method adds a 
security service object, Provider, to a Security object. The 
Security object centralizes all the security properties in an 
application. The super keyword refers to the superclass. From 
this minset, we can infer that it describes a constructor that 
probably overrides a method in its superclass. We also infer 
that it catches an exception when adding the provider fails. 


In the source, we confirm that it is a constructor in the 
HsqlSocketFactorySecure class in the CloverETL project. It 
wraps code that instantiates a Provider class and adds it to 
the Security object in a try block. If adding the provider fails, 
it catches the exception, as we had inferred. 

V. Applications 

Though our study is primarily empirical, in this section, we 
describe existing and new applications for minsets. 

SmartSynth (Existing) As we mentioned earlier, the clearest 
and, perhaps, most promising application for minsets is in 
keyword-based programming. SmartSynth [14] is a recent, 
modern incarnation. SmartSynth generates a smartphone script 
from a natural language description (query). “Speak weather in 
the morning” is an example of a successful query. SmartSynth 
uses NLP techniques to parse the query and map it to a set of 
“components” (words) in its underlying programming language. 
Combining a variety of techniques, it then infers relationships 
between the words to generate and rank candidate scripts. 
At its heart is the idea that usable code can be constructed 
from a small set of words. This subset is a minset or another 
distinguishing subset. 

Code Search Engine (New) A major problem of code search 
is ranking results [2, 18, 21]. We built a code search engine 
that uses a new ranking scheme 8 . Relevant methods are ranked 
by the similarity between their minsets and the user’s query. 
For example, the query “sort array int” returns 135 methods. 
The top result, with minset “sort array parselnt 16”, returns a 
sorted array of integers, if the ‘sort’ flag is set. 

Code Summarizer (New) From our case studies of Min4 
minsets, we realized that minsets can effectively summarize 
code. We built a code summary web application 8 . A user enters 
the source code of a method, our tool computes a minset, and 
presents it as a concise summary. Due to space constraints, we 
omit a full example and invite interested readers to explore 
our web application. Figure 10 shows examples of minsets 
summarizing methods. 

VI. Discussion 

The main purpose of this study was to test our “wheat and 
chaff” hypothesis. We have shown, over a variety of lexicons, 
that functions can be identified by a subset of their words, 
that those subsets tend to be very small, and suggested a 

^http://jarvis. cs.ucdavis.edu/code_essence. 


9 









































































lexicon, Min 4, that induces those minsets to be more natural 
and meaningful. Thus, our results clearly support our “wheat 
and chaff" hypothesis. 

Our results offer insight into how to develop powerful, alterna¬ 
tive programming systems. Consider an integrated development 
environment (IDE), like Eclipse or IntelliJ. that can search 
a MlNSET indexed database of code and requirements to 
1) propose related code that may be adapted to purpose, 2) auto- 
complete whole code fragments as the programmer works, 

3) speed concept location for navigation and debugging, and 

4) support traceability by interconnecting requirements and 
code [6], 

Other Lexicons Our lexicon exploration avoided variable 
names because they are so unconstrained, noisy, and rife 
with homonyms and synonyms. Minsets over lexicons, like 
LEX, that incorporated them could include trivial, semantically 
insignificant differences, like user vs. usr in Unix. At the same 
time, variable names are an alluring source of signal. Intuitively, 
and in this corpus, they are the largest class of identifiers, which 
comprise 70% of source code [8], and connect a program’s 
source to its problem domain [4]. In future work, we plan to 
separate the “wheat from the chaff" in variable names. 

Alternatives to Functions We chose functions as our semantic 
unit of discourse. However, we can apply the same methodology 
at other semantic levels. One alternative is to study blocks of 
code. A single function can have many blocks. This could 
be very useful in alternative programming systems where the 
user seeks a common block of code but for which there is 
no individual function. Another alternative is to use abstract 
syntax trees (AST). 

Threats to Validity We identify two main threats. The first 
is that we only studied Java. However, we have no reason to 
believe that the “wheat and chaff" hypothesis does not hold 
for other programming languages. Java, though more modern, 
was designed to be very similar to C and C++ so that it could 
be adopted easily. The second threat comes from our corpus: 
size and diversity. We downloaded a very large corpus, by any 
standard. In fact, we downloaded all the Java projects listed as 
“Most Popular" in the four code repositories we crawled. Those 
code repositories are known primarily for hosting open-source 
projects. Thus, there is no indication that they are biased toward 
any specific types of projects. We plan to replicate this study on 
a larger Java corpus and with language of different paradigms 
like List and Prolog to help us understand to what extent the 
“wheat and chaff" phenonemon varies. 

VII. Related Work 

Although we are the first to study the phenomenon of “wheat” 
and “chaff" in code 9 , a few strands of related work exist. 

Code Uniqueness At a basic level, our study is about 
uniqueness. Gabel and Su also studied uniqueness [9]. They 
found that software generally lacks uniqueness which they mea¬ 
sure as the proportion of unique, fixed-length token sequences 
in a software project. We studied uniqueness differently. We 
captured the distinguishing core semantics (the essence) of 
a piece of code in a unique subset of syntactic features, a 
MlNSET, whose elements may not be unique or even rare 
but together uniquely identify a piece of code. We keep in 

^Others have used the "wheat and chaff” analogy in the computing world 
but in different domains [29, 30]. 


mind that syntactic differences do not always imply functional 
differences as Jiang and Su demonstrated [13]. Thus, in some 
cases two minsets may represent the same high-level behavior. 

Code Completion and Search Observations about natural 
language phenomenon provide a promising path toward making 
programming easier. Hindle et al. focused on the ‘naturalness’ 
of software [12]. They showed that actual code is “regular and 
predictable", like natural language utterances. To do so, they 
trained an n-gram model on part of a corpus, and then tested 
it on the rest. They leveraged code predictability to enhance 
Eclipse’s code completion tool. Their work followed that of 
Gabel and Su who posited and gave supporting evidence that 
we are approaching a ‘singularity’, a point in time where all 
the small fragments of code we need to write already exist [9]. 
When that happens, many programming tasks can be reduced 
to finding the desired code in a corpus. Our work suggests that 
small, natural set of words, captured in a MlNSET, can index 
and retrieve code. As for code completion, a MlNSET-based 
approach could exploit not just the previous n — 1 tokens, but 
on all the previous tokens and complete not just the next token 
but whole pieces of code. 

Sourcerer and Portolio, two modern code search engines, 
support basic term queries, in addition to more advanced 
queries [2, 20]. Our research suggest the natural and efficient 
term query is a MlNSET. Results may differ in granularity. 
Portfolio focuses on finding functions [20] while Exemplar, 
another engine, finds whole applications [11], MlNSET easily 
generalizes to arbitrary code fragments. Finally, code search 
must also be ‘internet-scale’ [10], and with a modest computer, 
we can compute minsets for corpora of code of various 
languages, and update them regularly as new code is added. 

Code completion tools suggest code a programmer might 
want to use. They infer relevant code and rank it. Many diverse, 
useful tools and strategies exist [5, 24, 25, 32]. Our work 
suggests a different, complementary MlNSET-based strategy: If 
what the programmer is coding contains the MlNSET of some 
piece of code, suggest that. 

Genetics and Debugging At a high-level. Algorithm 1 isolates 
a minimal set of essential elements. Central to synthetic biology 
is the search for the ‘minimal genome’, the minimal set of 
genes essential to living organisms [1] [19]. Delta debugging 
is very similar in that it finds a minimal set of lines of code 
that trigger a bug [7]. Both approaches rely on an oracle who 
defines what is ‘essential’ whereas we define ‘essentialness’ 
with respect to other sets. 

VIII. Conclusion and Future Work 

We imagine that code, to the human mind, is amorphous, and 
ask: “If a programmer were reading this code, what features 
would be semantically important?” and “If a programmer were 
trying to write this piece of code, what key ideas would the 
programmer communicate?" A MlNSET is our proposal of a 
useful, formal definition of these key ideas as ‘wheat.’ Our 
definition is constructive, so a computer can compute Minsets 
to generate or retrieve an intended piece of code. 

We evaluated Minsets, over a large corpus of real-world 
Java programs, using various, natural lexicons: the computed 
minsets are sufficiently small and understandable for use in 
code search, code completion, and natural programming. 
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